A. Setup and Primer

A.1. Load Packages

First, we will install and load packages that we need for this tutorial. In order to use the tidycensus package, you will need to sign up for a Census API key: https://api.census.gov/data/key_signup.html

#install.packages("tidyverse")
#install.packages("readr")
#install.packages("sf")
#install.packages("tidycensus")
#install.packages("patchwork")

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(tidycensus)
library(ggspatial) # north arrow and scale bar
library(patchwork) # combine multiple maps
# library(ipumsr)

# census_api_key("INSERT API KEY", install = TRUE)
#this is christina's but delete later
# census_api_key("0d1515194f46f009f34f94afcbf045315abdbfbd", install = TRUE)

A.2. Primer to Tutorial Syntax

When referring to a new function for the first time, we will use the double colon operator (::) to specify the source package of the function. For example, dplyr::filter() refers to the filter() function from the dplyr package, which is used to subset rows in a data frame that meets certain conditions (i.e., where Species is setosa).

data(iris) # This loads the iris dataset

dplyr::filter(iris, Species == 'setosa')
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa

In this tutorial, we will use pipes (%>%) from the magrittr package in order to clearly express sequences of steps. Piping allows us to read the code from top –> bottom rather than inside –> out.

# Without pipes
head(dplyr::select(dplyr::filter(iris, Species == 'setosa'), Sepal.Length, Sepal.Width))
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9
# With pipes
iris %>% 
  dplyr::filter(Species == 'setosa') %>% 
  dplyr::select(Sepal.Length, Sepal.Width) %>% 
  head()
##   Sepal.Length Sepal.Width
## 1          5.1         3.5
## 2          4.9         3.0
## 3          4.7         3.2
## 4          4.6         3.1
## 5          5.0         3.6
## 6          5.4         3.9

B. Explore Available Census Data

You can get a list of variable names and descriptions from the U.S. Census using the load_variables() function from the tidycensus package. Since we are interested in the year 2021, let’s use the American Community Survey (vs. the Decennial Census data). We will use the 5-year ACS estimates and filter to variables that are available at the Census tract level.

variables_2021_tract <- tidycensus::load_variables(year = 2021, dataset = "acs5", cache = TRUE) %>% 
  dplyr::filter(geography == "tract")
variables_2021_tract %>% head()
## # A tibble: 6 × 4
##   name        label                                   concept          geography
##   <chr>       <chr>                                   <chr>            <chr>    
## 1 B01001A_001 Estimate!!Total:                        SEX BY AGE (WHI… tract    
## 2 B01001A_002 Estimate!!Total:!!Male:                 SEX BY AGE (WHI… tract    
## 3 B01001A_003 Estimate!!Total:!!Male:!!Under 5 years  SEX BY AGE (WHI… tract    
## 4 B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years   SEX BY AGE (WHI… tract    
## 5 B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE (WHI… tract    
## 6 B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE (WHI… tract

The str_detect() function from the stringr package allows us to search text strings for key words. We can use it to find the variable names related to median household income and educational attainment.

variables_2021_tract %>% 
  dplyr::filter(concept %>% stringr::str_detect('MEDIAN INCOME'))
## # A tibble: 25 × 4
##    name       label                                            concept geography
##    <chr>      <chr>                                            <chr>   <chr>    
##  1 B06011_001 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  2 B06011_002 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  3 B06011_003 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  4 B06011_004 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  5 B06011_005 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  6 B07011_001 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  7 B07011_002 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  8 B07011_003 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
##  9 B07011_004 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
## 10 B07011_005 Estimate!!Median income in the past 12 months -… MEDIAN… tract    
## # ℹ 15 more rows
variables_2021_tract %>% 
  dplyr::filter(concept %>% stringr::str_detect('EDUCATION'))
## # A tibble: 498 × 4
##    name       label                                            concept geography
##    <chr>      <chr>                                            <chr>   <chr>    
##  1 B06009_001 Estimate!!Total:                                 PLACE … tract    
##  2 B06009_002 Estimate!!Total:!!Less than high school graduate PLACE … tract    
##  3 B06009_003 Estimate!!Total:!!High school graduate (include… PLACE … tract    
##  4 B06009_004 Estimate!!Total:!!Some college or associate's d… PLACE … tract    
##  5 B06009_005 Estimate!!Total:!!Bachelor's degree              PLACE … tract    
##  6 B06009_006 Estimate!!Total:!!Graduate or professional degr… PLACE … tract    
##  7 B06009_007 Estimate!!Total:!!Born in state of residence:    PLACE … tract    
##  8 B06009_008 Estimate!!Total:!!Born in state of residence:!!… PLACE … tract    
##  9 B06009_009 Estimate!!Total:!!Born in state of residence:!!… PLACE … tract    
## 10 B06009_010 Estimate!!Total:!!Born in state of residence:!!… PLACE … tract    
## # ℹ 488 more rows

We can also use the RStudio IDE to search the variables.

variables_2021_tract %>% view()

C. Importing Data

C.1. Recent Census data

C.1.1. A snapshot of 2021

To download ACS data using the get_acs() function, we need three pieces of information.

  • First, what do we want our geography to be? What’s our unit of analysis? For this we have decided to use the Census tracts of New York State. See a full list of available geographies here: https://walker-data.com/tidycensus/articles/basic-usage.html#geography-in-tidycensus

  • Second, what variables do we want? We can pull those directly in using the same command line. We can even give it our own titles.

  • Third, what year of data do we want? There’s room for that too!

Let’s use median income in the past 12 months (B06011_001) and educational attainment (B06009_001, B06009_002, B06009_003, B06009_004, B06009_005).

acs_data <- get_acs(geography = "tract",
                    variables = c(hhincome = "B06011_001", 
                                  total_education = "B06009_001", 
                                  education_lessthanhs = "B06009_002", 
                                  education_hs = "B06009_003", 
                                  education_somecollege = "B06009_004", 
                                  education_bachelors = "B06009_005"),
                    state = "NY",
                    year = 2021,
                    output = "wide")
## Getting data from the 2017-2021 5-year ACS
# Convert educational attainment counts to proportions
acs_data <- acs_data %>%
  mutate(
    perc_lessthanhs = (education_lessthanhsE / total_educationE),
    perc_hs = (education_hsE / total_educationE),
    perc_somecollege = (education_somecollegeE / total_educationE),
    perc_bachelors = (education_bachelorsE / total_educationE)
  )

#the above command brings in the data as a dataframe, which means there is no geometry attached
class(acs_data)
## [1] "tbl_df"     "tbl"        "data.frame"

Let’s begin by exploring the data. We can do that with summary() or a variety of data visualization techniques.

A histogram visually represents the distribution of a dataset by grouping data into bins and displaying the frequency of data points within each bin.

#Explore data
summary(acs_data[, c("perc_lessthanhs", "perc_hs", "perc_somecollege", "perc_bachelors")])
##  perc_lessthanhs      perc_hs       perc_somecollege perc_bachelors  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.05360   1st Qu.:0.1922   1st Qu.:0.1892   1st Qu.:0.1301  
##  Median :0.09948   Median :0.2676   Median :0.2495   Median :0.1919  
##  Mean   :0.12877   Mean   :0.2622   Mean   :0.2469   Mean   :0.2035  
##  3rd Qu.:0.17491   3rd Qu.:0.3357   3rd Qu.:0.3064   3rd Qu.:0.2640  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :0.7143  
##  NA's   :115       NA's   :115      NA's   :115      NA's   :115
#Histogram for distribution of pooulation with less than high school education across census tracts
plot_hist <- ggplot(acs_data, aes(x = perc_lessthanhs)) +
  geom_histogram(binwidth = 0.03, fill = "lightblue", color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Population with Less Than High School Education",
       x = "Percentage",
       y = "Frequency")

plot_hist
## Warning: Removed 115 rows containing non-finite outside the scale range
## (`stat_bin()`).

A box plot, or box-and-whisker plot, displays the distribution of a dataset by showing its median, quartiles, and potential outliers.

# Boxplot for percentage of population with a  high school degree across census tracts
plot_box <- ggplot(acs_data, aes(y = perc_lessthanhs)) +
  geom_boxplot(fill = "blue", color = "black", alpha = 0.5) +
  theme_minimal() +
  labs(title = "Boxplot of Population with Less Than High School Education",
       y = "Percentage")

plot_box
## Warning: Removed 115 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

A bar plot represents categorical data with rectangular bars, where the height or length of each bar is proportional to the value it represents.

#Bar plots for percentages

# Reshape the data for easier plotting
education_data <- acs_data %>%
  select(GEOID, perc_lessthanhs, perc_hs, perc_somecollege, perc_bachelors) %>%
  pivot_longer(cols = starts_with("perc_"), names_to = "education_level", values_to = "proportion")

print(head(education_data))
## # A tibble: 6 × 3
##   GEOID       education_level  proportion
##   <chr>       <chr>                 <dbl>
## 1 36001000100 perc_lessthanhs       0.246
## 2 36001000100 perc_hs               0.283
## 3 36001000100 perc_somecollege      0.234
## 4 36001000100 perc_bachelors        0.155
## 5 36001000201 perc_lessthanhs       0.117
## 6 36001000201 perc_hs               0.223
summary(education_data$proportion)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.1212  0.2096  0.2103  0.2900  1.0000     460
# Bar plot for educational attainment percentages
plot_bar <- education_data %>% 
  dplyr::group_by(education_level) %>% 
  dplyr::summarize(mean_proportion = mean(proportion, na.rm = TRUE)) %>% 
  ggplot(aes(x = education_level, y = mean_proportion, fill = education_level)) +
  geom_bar(stat = 'identity', color = "black", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Educational Attainment Proportions",
       x = "Education Level",
       y = "Proportion") +
  theme(legend.position = "none")

plot_bar

Now first we downloaded the ACS data using tidycensus, but it was just a dataframe. This means there was no spatial component attached. The good news about get_acs() is that we can also ask it to bring in geometries – an essential tool to make a map!

#to also bring in geometries, modify the command to bring in geometries
acs_data_geo <- get_acs(
    geography = "tract",
    variables = c(hhincome = "B19013_001",
                  total_education = "B16010_001", 
                  education_lessthanhs= "B16010_002", 
                  education_hs = "B16010_015", 
                  education_somecollege = "B16010_028", 
                  education_bachelors = "B16010_041"),
    state = "NY",
    year = 2021,
    output = "wide", 
    geometry = TRUE) %>%
  dplyr::transmute(
    GEOID = GEOID,
    hhincome = hhincomeE,
    perc_lessthanhs = (education_lessthanhsE / total_educationE),
    perc_hs = (education_hsE / total_educationE),
    perc_somecollege = (education_somecollegeE / total_educationE),
    perc_bachelors = (education_bachelorsE / total_educationE)
  )
## Getting data from the 2017-2021 5-year ACS
## Downloading feature geometry from the Census website.  To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=                                                                     |   1%
  |                                                                            
  |=                                                                     |   2%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |===                                                                   |   4%
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |======                                                                |   9%
  |                                                                            
  |=======                                                               |  10%
  |                                                                            
  |========                                                              |  11%
  |                                                                            
  |========                                                              |  12%
  |                                                                            
  |=========                                                             |  13%
  |                                                                            
  |===========                                                           |  16%
  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=============                                                         |  19%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |==============                                                        |  21%
  |                                                                            
  |================                                                      |  22%
  |                                                                            
  |================                                                      |  23%
  |                                                                            
  |=======================                                               |  32%
  |                                                                            
  |=============================                                         |  42%
  |                                                                            
  |===================================                                   |  51%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |================================================                      |  69%
  |                                                                            
  |=======================================================               |  78%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |====================================================================  |  97%
  |                                                                            
  |======================================================================| 100%
#using base R to plot

#NY state
plot(st_geometry(acs_data_geo), col = NA, border = 'grey')

C.1.2. Range of Data: Median Household Income from 2009 to 2021

# Define the years of interest
years <- 2009:2021 

#define the variables from above

variables <- c(hhincome = "B19013_001")

#download data for multiple years

# Function to download ACS data for a given year
get_acs_data <- function(year) {
  get_acs(
    geography = "tract",
    variables = variables,
    state = "NY",
    year = year,
    output = "wide",
    geometry = TRUE
  ) %>%
    mutate(year = year)  # Add a column for the year
}

# Download data for all years and combine into one data frame
acs_data_list <- lapply(years, get_acs_data)
acs_data_all_years <- bind_rows(acs_data_list)

View(acs_data_all_years)

# Check the structure of the combined data
str(acs_data_all_years)

#plotting median household income over the years

#there are over 4,000 census tracts so let's aggregate and plot averages

average_income <- acs_data_all_years %>%
  group_by(year) %>%
  summarise(avg_hhincome=mean(hhincomeE, na.rm=TRUE))

income_overtime <- ggplot(average_income, aes(x=year, y=avg_hhincome))+
  geom_line(color="blue") +
  geom_point(color="blue")+
  theme_minimal()+
  labs(title="Average Median Household Income Over Time in NY (2009-2021)", 
       x="Year", 
       y= "Average Median Household Income")

income_overtime

You can also map these! But we’ll do this later.

C.2. PLACES Data (health outcomes)

The PLACES data was prepared by the Centers for Disease Control and Prevention (CDC), Robert Wood Johnson Foundation, and the CDC Foundation.

places_prelim_df <- readr::read_csv("~/Library/CloudStorage/OneDrive-ColumbiaUniversityIrvingMedicalCenter/PhD Admin/Projects Collaborations/R Consortium with Stephen/places2023.csv")
## Rows: 2555113 Columns: 23
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): StateAbbr, StateDesc, CountyName, CountyFIPS, LocationName, DataSo...
## dbl  (5): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, Tot...
## lgl  (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

D. Cleaning Data

C.1. PLACES

places_prelim_df %>% head() %>% View()

places_cvd <- places_prelim_df %>% 
  dplyr::filter(MeasureId == 'CHD' & StateAbbr == 'NY') %>% # measureid
  dplyr::transmute(GEOID = LocationName,
                   state = StateAbbr, # stateabbr
                   chd_prev = Data_Value/100) # data_value

C.2. Merge Data

merged <- acs_data_geo %>% 
  dplyr::left_join(places_cvd, by = 'GEOID')

D. Exploring the Data

D.1. Summarize with a table

merged %>% 
  skimr::skim(hhincome, perc_lessthanhs, perc_hs, perc_somecollege, perc_bachelors) 
Data summary
Name Piped data
Number of rows 5411
Number of columns 9
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
hhincome 213 0.96 82543.44 40423.44 2499 55325.50 74980.00 101862.00 250001 ▃▇▃▁▁
perc_lessthanhs 115 0.98 0.13 0.10 0 0.05 0.10 0.17 1 ▇▂▁▁▁
perc_hs 115 0.98 0.26 0.11 0 0.19 0.27 0.34 1 ▃▇▁▁▁
perc_somecollege 115 0.98 0.25 0.09 0 0.19 0.25 0.31 1 ▃▇▁▁▁
perc_bachelors 115 0.98 0.36 0.20 0 0.21 0.33 0.48 1 ▅▇▅▂▁

D.2. Summarize with box and violin plots

# Income
merged %>% 
  ggplot(aes(y = hhincome, x = "")) + 
  geom_violin() +
  geom_boxplot(width = 0.2) + 
  labs(x = "", y = "Median Household Income")
## Warning: Removed 213 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 213 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Educational attainment
merged_long_edu <- merged %>% 
  tidyr::pivot_longer(cols = -c(GEOID, state, geometry, hhincome),
                      names_to = 'variable_name',
                      values_to = 'value') 

merged_long_edu %>% 
  ggplot(aes(y = value, x = "")) + 
  geom_violin() +
  geom_boxplot(width = 0.2) +
  facet_grid(. ~ variable_name) + 
  labs(x = "", y = "Proportion")
## Warning: Removed 1533 rows containing non-finite outside the scale range
## (`stat_ydensity()`).
## Warning: Removed 1533 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

E. Mapping the Data

Use the ggplot::geom_sf() in order to plot the Census tract boundaries for New York.

merged %>% 
  ggplot() +
  ggplot2::geom_sf()

E.1. Creating a Choropleth Map for a Single Variable

A choropleth map is a thematic map in which areas are shaded or patterned in proportion to the value of a variable being represented. This type of map is commonly used to visualize how a measurement varies across a geographic area, such as population density or median income.

… Adding in a variable to create choropleth map

In ggplot::geom_sf(), use the fill argument in the aes() function to specify a variable that you would like to visualize In our example, we are visualizing the geographic distribution of those who have a percentage high school education.

smap_1 <- merged %>% 
  ggplot() +
  geom_sf(aes(fill = hhincome))

smap_1

… Change legend colors

Use the low and high arguments in the scale_fill_continuous() to specify colors to use for low and high values on a continuous scale.

smap_2 <- smap_1 + 
  scale_fill_continuous(low = 'white', high = 'blue')

smap_2

… Changing map theme

Use the theme_void() function to remove the background and axis marks.

smap_3 <- smap_2 +
  theme_void()

smap_3

… add a title

Use the title argument in the labs() function in order to create a title for the map.

smap_4 <- smap_3 +
  labs(title = 'High School Educational Attainment in New York')

smap_4

… adding in a north arrow and scale bar

Use the annotation_north_arrow() and annotation_scale() functions from the ggspatial package to add in a north arrow and scalebar to the map - best practices when building a map.

  • The location = tl argument in annotation_north_arrow() function is used to place the north arrow on the top left area of the map.
  • The unit_category = 'imperial' argument in annotation_scale() function is used to change the units in the scale bar from metric to imperial units (showing miles instead of kilometers).
smap_5 <- smap_4 +
  ggspatial::annotation_north_arrow(location = 'tl') + # tl for top-left
  ggspatial::annotation_scale(unit_category = 'imperial') # imperial units to show miles instead of km

smap_5

Putting it all together

We can create the map using the previous functions in one step.

smap <- merged %>% 
  ggplot() +
  ggplot2::geom_sf() +
  geom_sf(aes(fill = perc_hs)) + 
  scale_fill_continuous(low = 'white', high = 'blue') + 
  theme_void() +
  labs(title = 'High School Educational Attainment in New York') +
  ggspatial::annotation_north_arrow(location = 'tl') +
  ggspatial::annotation_scale(unit_category = 'imperial')

smap

E.2. Creating Choropleth Maps for Multiple Variables

Prepare long dataset with CHD and educational variables

Long datasets are easier to use with ggplot2.

A long dataset, also known as a “tidy” dataset, is structured so that each row represents a single observation, with one column for each variable and an additional column for the values. This format is often used for time series or repeated measures data.

In contrast, a wide dataset has a format where each row represents a single subject or entity, with multiple columns for different measurements or observations taken at different times or under different conditions. This format is common in datasets where comparisons across different time points or conditions are needed.

We use the tidyr::pivot_longer() function in order to create one column for the values of the CHD and educational attainment variables in our dataset.

merged_long <- merged %>% 
  tidyr::pivot_longer(cols = c(chd_prev, perc_lessthanhs, perc_hs, perc_somecollege, perc_bachelors),
                      names_to = 'variable_name',
                      values_to = 'value') 

Create a Facet Plot

Use the facet_wrap() function to create a facet plot (allows us to create multiple maps in a grid) for our 5 variables. We use the ncol = 3 argument to specify that we would like to create a grid of maps with 3 columns.

multmap_1 <- merged_long %>% 
  ggplot() +
  ggplot2::geom_sf() +
  geom_sf(aes(fill = value)) +
  facet_wrap(vars(variable_name), ncol = 3)

multmap_1

… Adding in different legend colors, theme, title, north arrow, and scalebar

Use the same functions as we used with the single variable choropleth map in order to add different legend colors, theme, a title, a north arrow, and a scalebar.

multmap_2 <- multmap_1 +
  scale_fill_continuous(low = 'white', high = 'blue') +
  theme_void() +
  labs(title = 'High School Educational Attainment in New York') +
  ggspatial::annotation_north_arrow(location = 'tl') +
  ggspatial::annotation_scale(unit_category = 'imperial')

multmap_2

Putting it all together

We can create the map using the previous functions in one step.

multmap <- merged_long %>% 
  ggplot() +
  ggplot2::geom_sf() +
  geom_sf(aes(fill = value)) +
  facet_wrap(vars(variable_name), ncol = 3) + 
  scale_fill_continuous(low = 'white', high = 'blue') + 
  theme_void() +
  labs(title = 'High School Educational Attainment in New York') +
  ggspatial::annotation_north_arrow(location = 'tl') +
  ggspatial::annotation_scale(unit_category = 'imperial')

multmap

E.3. Combining Multiple Maps

Using the patchwork package, we can use the / operator to stack our previously created maps, one on top of the other.

smap / multmap

F. Additional Resources

Below are some additional references and resources: